Skip to content

Conversation

kdelee
Copy link
Member

@kdelee kdelee commented Aug 12, 2025

  • What is being changed?
    This commit introduces a ObservabilityMiddleware that provides request tracing, performance profiling, and secure SQL query commenting. This consists of three components (ProfileRequestMiddleware, SQLProfilingMiddleware, TraceContextMiddleware) to simplify testing and maintenance, with a single public middleware for apps to consume.

  • Why is this change needed?
    The goal is to provide a consistent way for any django-ansible-base consumer to gain performance and debugging insights. Many of these features were previously implemented in AWX, but not available in other apps.

  • How does this change address the issue?
    It creates a single middleware (ObservabilityMiddleware) that developers can add to the top of their MIDDLEWARE list. This ensures the correct ordering of tracing, profiling, and SQL commenting.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test update
  • Refactoring (no functional changes)
  • Development environment change
  • Configuration change

Self-Review Checklist

  • I have performed a self-review of my code
  • I have added relevant comments to complex code sections
  • I have updated documentation where needed
  • I have considered the security impact of these changes
  • I have considered performance implications
  • I have thought about error handling and edge cases
  • I have tested the changes in my local environment

Testing Instructions

Prerequisites

  • In your Django settings (e.g., test_app/defaults.py), ensure the new middleware is at the top of the MIDDLEWARE list:

1 MIDDLEWARE = [
2 'ansible_base.lib.middleware.observability.ObservabilityMiddleware',
3 ...
4 ]

  • To test all features, enable the following settings:

1 ANSIBLE_BASE_CPROFILE_REQUESTS = True
2 ANSIBLE_BASE_SQL_PROFILING = True

Note: ANSIBLE_BASE_CPROFILE_REQUESTS has significant performance implications and is intended for live debugging, not for permanent use in production.
  • To verify the SQL comments, you must have access to the database logs (e.g., by setting log_statement=all or enabling slow query logging in postgresql.conf).
  • The SQL commenting feature is most effective when combined with slow query logging, which can be configured to log only a percentage of requests to manage overhead.

Steps to Test

  1. Include the middleware in the test_app
  2. Enable ANSIBLE_BASE_CPROFILE_REQUESTS and ANSIBLE_BASE_SQL_PROFILING settings
  3. Enable log_statement=all in the postgres
  4. Start the application.
  5. Make an API request to any endpoint without providing a request ID:
curl -v -H "X-Request-ID: $(uuidgen)" http://localhost:8000/api/v1/

Expected Results

  • The response headers should include X-API-Time, X-API-Node, X-API-CProfile-File, and X-Request-ID.
  • In the database logs, you should see SQL statements prepended with a secure, sanitized comment, for example: /* trace_id='...', route='...', origin='...' */.
  • The X-Request-ID header in the response should contain a newly generated UUID.

Additional Context

This work is a generic implementation of the profiling method originally used by @chrismeyersfsu for diagnosing https://issues.redhat.com/browse/AAP-50642.

Required Actions

  • Requires documentation updates
    • The README.md for the middleware has been updated.
  • Requires downstream repository changes
    • controller, eda-server, and gateway will need to update their MIDDLEWARE settings to replace the three old middleware classes with the new ansible_base.lib.middleware.observability.ObservabilityMiddleware.
  • Requires infrastructure/deployment changes
  • Requires coordination with other teams
  • Blocked by PR/MR: #XXX

replaces #794

@kdelee
Copy link
Member Author

kdelee commented Aug 13, 2025

Manual Verification in aap-dev

To ensure the ObservabilityMiddleware works correctly in AAP, I manually tested my changes within the aap-dev local development environment.
My process involved applying the modifications from django-ansible-base to the controller, eda, and gateway components and then inspecting both API responses and database logs to confirm the behavior.

1. Applying the Changes

I made changes to eda, gateway, and controller to include the middleware and enable the settings. The make aap command took care of building the images with the updated django-ansible-base since I had the sources configured.
Here is example gateway change, for the other components we would only need the settings/middleware, this one also includes changes to grpc: https://github.com/ansible-automation-platform/aap-gateway/compare/devel...kdelee:aap-gateway:add_dab_middleware?expand=1

2. Verifying End-to-End Trace Propagation

I made requests to the me endpoint of each service, captured the X-Request-ID from the response, and then traced that ID all the way to the database logs.
This test relies on the ANSIBLE_BASE_SQL_PROFILING setting being enabled, which activates the SQL comment injection feature of the ObservabilityMiddleware.


EDA Verification

Step 1: Make API Request to EDA

curl -i -u "admin:r9hwurZIPH1U9FlUWl2SuZMOUnGuBIY6" http://localhost:44926/api/eda/v1/users/me/

Response Headers:

HTTP/1.1 200 OK
x-request-id: 29700e74-7812-4678-833c-2e5760e9434d
x-api-query-count: 4
x-api-query-time: 0.010s
x-api-time: 0.056s
x-api-node: Unknown
x-api-cprofile-file: /tmp/cprofile-29700e74-7812-4678-833c-2e5760e9434d.prof
x-envoy-upstream-service-time: 70
x-grpc-auth-time: 0.478s
x-grpc-auth-cprofile-file: /tmp/cprofile-29700e74-7812-4678-833c-2e5760e9434d.prof
x-grpc-auth-node: myaap-gateway-7cd9fdd4f8-875gd

The X-Request-ID I received was 29700e74-7812-4678-833c-2e5760e9434d.

Step 2: Trace EDA Request in Database Logs

kubectl logs myaap-postgres-15-0 -n aap26 | grep "29700e74-7812-4678-833c-2e5760e9434d"

PostgreSQL Log Output:

2025-08-13 18:19:37.918 UTC [62881] LOG:  statement: /* trace_id='29700e74-7812-4678-833c-2e5760e9434d', route='api/eda/v1/users/me/', origin='request' */ SELECT "core_user"."id", "core_user"."password", ...
2025-08-13 18:19:37.927 UTC [62881] LOG:  statement: /* trace_id='29700e74-7812-4678-833c-2e5760e9434d', route='api/eda/v1/users/me/', origin='request' */ SELECT "dab_rbac_roleuserassignment"."id", ...
2025-08-13 18:19:37.938 UTC [62881] LOG:  statement: /* trace_id='29700e74-7812-4678-833c-2e5760e9434d', route='api/eda/v1/users/me/', origin='request' */ SELECT "dab_resource_registry_resource"."id", ...
2025-08-13 18:19:37.941 UTC [62881] LOG:  statement: /* trace_id='29700e74-7812-4678-833c-2e5760e9434d', route='api/eda/v1/users/me/', origin='request' */ SELECT "django_content_type"."id", ...

Controller Verification

Step 1: Make API Request to Controller

curl -i -u "admin:r9hwurZIPH1U9FlUWl2SuZMOUnGuBIY6" http://localhost:44926/api/controller/v2/me/

Response Headers:

x-api-node: myaap-controller-web-766675647f-z79k8
x-api-time: 0.266s
content-language: en
x-api-query-count: 14
x-api-query-time: 0.020s
x-api-cprofile-file: /tmp/cprofile-92a780e1-2c1b-40e7-8b69-69182687190d.prof
x-request-id: 92a780e1-2c1b-40e7-8b69-69182687190d
x-api-request-id: 6cb8b20c100d4dccb24e85e814448427
x-envoy-upstream-service-time: 283
x-grpc-auth-time: 0.540s
x-grpc-auth-cprofile-file: /tmp/cprofile-92a780e1-2c1b-40e7-8b69-69182687190d.prof
x-grpc-auth-node: myaap-gateway-7cd9fdd4f8-875gd

The X-Request-ID I received was 92a780e1-2c1b-40e7-8b69-69182687190d.

Step 2: Trace Controller Request in Database Logs

kubectl logs myaap-postgres-15-0 -n aap26 | grep "92a780e1-2c1b-40e7-8b69-69182687190d"

PostgreSQL Log Output:

2025-08-13 18:19:49.314 UTC [62922] LOG:  statement: /* trace_id='92a780e1-2c1b-40e7-8b69-69182687190d', route='api/controller/%28%3FP%3Cversion%3E%28v2%29%29/me/%24', origin='request' */ SELECT "auth_user"."id", ...
2025-08-13 18:19:49.322 UTC [62922] LOG:  statement: /* trace_id='92a780e1-2c1b-40e7-8b69-69182687190d', route='api/controller/%28%3FP%3Cversion%3E%28v2%29%29/me/%24', origin='request' */ SELECT "dab_rbac_roleuserassignment"."id", ...
... (and 12 more lines)

Gateway Verification

Step 1: Make API Request to Gateway

curl -i -u "admin:r9hwurZIPH1U9FlUWl2SuZMOUnGuBIY6" http://localhost:44926/api/gateway/v1/me/

Response Headers:

x-api-node: myaap-gateway-7cd9fdd4f8-875gd
x-api-time: 0.480s
x-frame-options: DENY
x-content-type-options: nosniff
referrer-policy: same-origin
cross-origin-opener-policy: same-origin
x-api-query-count: 43
x-api-query-time: 0.060s
x-api-cprofile-file: /tmp/cprofile-1a802ac2-f155-4082-8398-5383afb0f80c.prof
x-request-id: 1a802ac2-f155-4082-8398-5383afb0f80c
x-envoy-upstream-service-time: 499
x-grpc-auth-time: 0.045s
x-grpc-auth-cprofile-file: /tmp/cprofile-1a802ac2-f155-4082-8398-5383afb0f80c.prof
x-grpc-auth-node: myaap-gateway-7cd9fdd4f8-875gd

The X-Request-ID I received was 1a802ac2-f155-4082-8398-5383afb0f80c.

Step 2: Trace Gateway Request in Database Logs

kubectl logs myaap-postgres-15-0 -n aap26 | grep "1a802ac2-f155-4082-8398-5383afb0f80c"

PostgreSQL Log Output:

2025-08-13 18:20:00.876 UTC [62959] LOG:  statement: /* trace_id='1a802ac2-f155-4082-8398-5383afb0f80c', route='api/gateway/v1/me/', origin='request' */ SELECT "dab_authentication_authenticator"."modified" FROM "dab_authentication_authenticator" ...
2025-08-13 18:20:00.881 UTC [62959] LOG:  statement: /* trace_id='1a802ac2-f155-4082-8398-5383afb0f80c', route='api/gateway/v1/me/', origin='request' */ SELECT "aap_gateway_api_user"."id", ...
... (and 41 more lines)

My successful grep for each service confirms that the trace_id from my initial web request was propagated through the ObservabilityMiddleware, picked up by the SQL profiling mechanism, and injected as a comment into every database query for that request. This demonstrates a complete, end-to-end trace from the API entry point to the database.

I'll link the other branches used.

Analyzing cProfile Output

With the ANSIBLE_BASE_CPROFILE_REQUESTS setting enabled, the middleware generates a .prof file for each request. The file path and the pod that generated it are available in the response headers.

Step 1: Identify the Pod and File Path

From the Controller verification step, the headers give us the necessary information:

  • x-api-node: myaap-controller-web-766675647f-z79k8
  • x-api-cprofile-file: /tmp/cprofile-92a780e1-2c1b-40e7-8b69-69182687190d.prof

Step 2: Copy the Profile File from the Pod

I used kubectl cp to copy the file from the pod to my local machine for analysis.

kubectl cp aap26/myaap-controller-web-766675647f-z79k8:/tmp/cprofile-92a780e1-2c1b-40e7-8b69-69182687190d.prof ./controller-me.prof

Step 3: Analyze the Profile Data

The .prof file can be analyzed with various tools.

Using snakeviz (for a graphical view):

snakeviz ./controller-me.prof

This opens a browser window with an interactive visualization of the call stack, making it easy to spot performance bottlenecks.

Using pstats (built-in Python profiler):
For a quick, terminal-based analysis, I used the pstats module.

import pstats

p = pstats.Stats('./controller-me.prof')
# Sort by cumulative time spent in functions and print the top 15
p.sort_stats('cumulative').print_stats(15)

This provides a detailed breakdown of function calls, time spent, and the number of calls, which is essential for deep performance analysis.

Reflection

This end-to-end tracing provides powerful insights into the full lifecycle of a request. By inspecting the headers and logs, we can answer critical performance questions:

  • By default, the ObservabilityMiddleware provides the x-api-time header shows us how long was spent in the target upstream service code. It also provides the x-grpc-auth-time header shows us exactly how long was spent on gRPC authentication before the request even hit the upstream service's code.
  • With ANSIBLE_BASE_SQL_PROFILING enabled X-API-Query-Time and X-API-Query-Count headers give us immediate feedback on how database-intensive a particular endpoint is.
  • With ANSIBLE_BASE_CPROFILE_REQUESTS enabled, we can get a complete picture of the request's execution, including time spent in Python code outside of database queries. This allows us to pinpoint performance bottlenecks in application logic, not just in the database.
  • Finally, the ability to correlate every single database query back to a specific trace_id is invaluable for debugging complex issues where multiple services and background tasks are involved.

kdelee added 7 commits August 28, 2025 16:32
This commit introduces `ProfileRequestMiddleware` to provide performance
insights for API requests, migrating and generalizing functionality
from AWX's `TimingMiddleware` and `AWXProfiler`. This makes the
functionality available to any consumer of `django-ansible-base`.

The middleware always adds an `X-API-Time` header to the response,
indicating the total time taken to process the request. To help locate
profiling data, it also adds an `X-API-Node` header with the cluster
host ID if one is not already present in the response.

When the `ANSIBLE_BASE_CPROFILE_REQUESTS` setting is enabled, the
middleware will also perform a cProfile analysis for each request.
The resulting `.prof` file is saved to a temporary directory on the
node that served the request, and its path is returned in the
`X-API-CProfile-File` response header.

For the most accurate and reliable timing, it is recommended to add
this middleware to the top of the `MIDDLEWARE` list in your settings.

The core profiling logic is encapsulated in the `DABProfiler` class,
which can be imported and used directly for profiling non-HTTP contexts,
such as background tasks.

This commit also includes a `README.md` file with documentation for
the new middleware and profiler, including usage examples and tests in
the test_app.
This commit refactors the SQL query profiling logic out of the AWX
controller's API view and into a reusable middleware in the
django-ansible-base library. This makes the feature available to all
AAP components, such as the gateway and EDA.

SQLProfilingMiddleware adds `X-API-Query-Count` and
`X-API-Query-Time` headers to API responses.

Changes include:
- A new `SQLProfilingMiddleware` class in the profiling module.
- Configuration is controlled by a new, namespaced setting,
  `ANSIBLE_BASE_SQL_PROFILING`, with a fallback to the legacy
  `SQL_DEBUG` setting for backward compatibility.
- The middleware logs a warning if it is enabled while Django's
  `DEBUG` setting is `False`, as query logging is disabled in that
  state.
- Unit tests have been added to ensure correct functionality and prevent regressions.
- The profiling documentation has been updated to include usage
  instructions for the new middleware.
This enhances the `SQLProfilingMiddleware` to provide deeper observability by integrating it with a new request tracing context system.

It introduces `TraceContextMiddleware`, which establishes a unique `trace_id` for each request using `contextvars`. This trace ID is then leveraged by the `SQLProfilingMiddleware` to inject a detailed comment, including the trace ID, route, and origin, into every SQL query. This allows for direct correlation between a web request and its database activity.

Key changes include:
- A new `TraceContextMiddleware` that sets and safely resets the request context.
- The `SQLProfilingMiddleware` is updated to inject contextual SQL comments.
- A new, isolated test suite for the `TraceContextMiddleware` to ensure it correctly handles the `X-Request-ID` header and prevents context bleeding between requests.
- Existing tests were made more robust, and a state pollution issue was fixed by ensuring proper context cleanup in the test suite.
Refactors the existing `trace_context` manager to align its behavior with the
`TraceContextMiddleware`, ensuring consistent and safe observability for
non-web tasks.

This commit addresses these issues by:
- Modifying `trace_context` to only accept `origin` and `trace_id`.
- Adding validation to ensure any provided `trace_id` is a valid UUID, matching the middleware's logic.
- Removing the ability to create arbitrary context variables via `**kwargs`.

Additionally, this change introduces:
- Tests for the `trace_context` manager, covering its functionality, validation, thread-safety, and use as a decorator.
- Documentation in the `README.md` with usage instructions and an example for background tasks.
@kdelee kdelee force-pushed the shared_middleware_with_trace_sql branch from 8039edc to e7f5ad5 Compare August 28, 2025 20:33
Copy link

DVCS PR Check Results:

Could not find JIRA key(s) in PR title, branch name, or commit messages

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
2 New issues

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant